Topic Modeling in Social Science
Latent Dirichlet Allocation
We will start with the simple latent dirichlet allocation with gibbs sampling. This part is adapted from Ethen Liu’s intuitive demo to LDA. For detailed intuitive or tech intro, please check our lecture slides or Blei’s article. You can also check here for gibbs sampling.
Latent Dirichlet Allocation (LDA) is a probabilistic topic modeling method that allows us to discover and annotate texts. The key assumptions are as follows (see Mohr and Bogdanov 2013) .
Each document (text) within a corpus is viewed as a bag-of-words produced according to a mixture of themes that the author of the text intended to discuss. Each theme (or topic) is a distribution over all observed words in the corpus, such that words that are strongly associated with the document’s dominant topics have a higher chance of being selected and placed in the document bag. Thus, the goal of topic modeling is to find the parameters of the LDA process that has likely generated the corpus.
Based on this week’s reading (Blei 2012), we know that the topic distribution for each document is
\[ \theta \sim Dirichlet(\alpha) \]
Where \(Dirichlet(\alpha)\) denotes the Dirichlet distribution for parameter \(\alpha\).
The word distribution for each topic also modeled by a Dirichlet distribution with a different parameter \(\eta\).
\[ \beta \sim Dirichlet(\eta) \]
Our goal is to estimate the \(\theta\) and \(\beta\) using observed words in documents. That being said, we are trying to understand which words are important for which topic and which topics are important for a particular document, respectively.
Note that the Dirichlet distribution is a probability distribution for parameters \(\alpha\). Where \(\alpha\) governs the concentration of the distribution. Sometimes people call this concentration parameter or scaling parameter. When \(\alpha\) approaches 0, it means documents are concentrated in a few topics. So a higher value suggests that topics are more evenly distributed accross the documents. This also applied to \(\beta\) regarding topic-word.
We will use Gibbs sampling to compute the conditional probability specified in Blei’s article (eq 2). Generally speaking, LDA is a generative model of word counts. We are interested in the conditional probability of hidden topic structure given the observed words in documents.
To simplify the demo process, we will use 10 short strings to represent 10 documents (Note that recent study shows that the length of document and the number of documents do influence our results. Just be careful about this). We deliberately get 5 sentences describing Chinese food and 5 sentences describing American football from Wikipedia.
Usually before running topic model, we need to normalize our texts as shown in our lecture (like tidy texts, removing stop words, white-spaces, etc.). We often use tidytext, tm, or quanteda packages in R to preprocess the texts, but now let us stick to basic stuff. I strongly suggest you to take some time to read the quanteda tutorial.
raw_docs <- c(
"Chinese cuisine is an important part of Chinese culture, which includes cuisine originating from the diverse regions of China, as well as from Overseas Chinese who have settled in other parts of the world.",
"The preference for seasoning and cooking techniques of Chinese provinces depend on differences in historical background and ethnic groups.",
"Chinese society greatly valued gastronomy, and developed an extensive study of the subject based on its traditional medical beliefs.",
"There are a variety of styles of cooking in China, but Chinese chefs have classified eight regional cuisines according to their distinct tastes and local characteristics. ",
"Based on the raw materials and ingredients used, the method of preparation and cultural differences, a variety of foods with different flavors and textures are prepared in different regions of the country. ",
"American football, referred to as football in the United States and Canada and also known as gridiron,is a team sport played by two teams of eleven players on a rectangular field with goalposts at each end",
"American football evolved in the United States, originating from the sports of soccer and rugby. The first American football match was played on November 6, 1869, between two college teams, Rutgers and Princeton, using rules based on the rules of soccer at the time.",
"American football is the most popular sport in the United States. The most popular forms of the game are professional and college football, with the other major levels being high school and youth football. ",
"In football, the winner is the team that has scored more points at the end of the game. There are multiple ways to score in a football game. ",
"Football games last for a total of 60 minutes in professional and college play and are divided into two halves of 30 minutes and four quarters of 15 minutes."
)
# lower cases and remove punctuation or double spaces
raw_docs <- stringr::str_replace_all(tolower(raw_docs),"[:punct:]","")
# remove stop words
stopwords_regex = paste(stopwords::stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
raw_docs <- stringr::str_replace_all(raw_docs,stopwords_regex, '')
# remove the most frequent words, chinese,american, football
raw_docs <- stringr::str_replace_all(raw_docs,"chinese|american|football", '')
raw_docs[[1]]## [1] " cuisine important part culture includes cuisine originating diverse regions china well overseas settled parts world"
# let us squish our text, removing extra spaces
raw_docs <- stringr::str_squish(raw_docs)
# segmenting each work, similar to tokenization.
docs <- strsplit(raw_docs, split = " ")
docs[[1]]## [1] "cuisine" "important" "part" "culture" "includes"
## [6] "cuisine" "originating" "diverse" "regions" "china"
## [11] "well" "overseas" "settled" "parts" "world"
# get a vocabulary of unique words in our corpus
vocab <- unique( unlist(docs) )
# represent strings using numerical numbers
# use the base match function match(x,table)
# If x[i] is found to equal table[j] then the value returned in the i-th position of the return value is j, for the smallest possible j.
for( i in 1:length(docs) ) {
docs[[i]] <- match( docs[[i]], vocab )
}
docs## [[1]]
## [1] 1 2 3 4 5 1 6 7 8 9 10 11 12 13 14
##
## [[2]]
## [1] 15 16 17 18 19 20 21 22 23 24 25
##
## [[3]]
## [1] 26 27 28 29 30 31 32 33 34 35 36 37
##
## [[4]]
## [1] 38 39 17 9 40 41 42 43 44 45 46 47 48 49
##
## [[5]]
## [1] 34 50 51 52 53 54 55 56 21 38 57 58 59 60 61 58 8 62
##
## [[6]]
## [1] 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##
## [[7]]
## [1] 81 64 65 6 82 83 84 85 86 72 87 88 89 73 90 74 91 92 93 94 34 94 83 95
##
## [[8]]
## [1] 96 71 64 65 96 97 98 99 90 100 101 102 103 104
##
## [[9]]
## [1] 105 70 106 107 80 98 108 109 110 98
##
## [[10]]
## [1] 111 112 113 114 115 99 90 116 117 73 118 119 115 120 121 122 115
In LDA, we have to specify the number of clusters (i.e., topics) first. Usually it was denoted by K. In this case, let us do 2.
If we recall correctly, in Blei’s article, he described the generative process of LDA. It has several major steps.
Blei 2012
Here, let us first go through each document and randomly assign each word in the document to one of the K topics. This is the topic assignment process. The right side of Blei’s article in Figure 1.
Then we create a word-topic matrix, which is the count of each word being assigned to each topic. And a document-topic matrix, which is the number of words assigned to each topic for each document.
# cluster number
K <- 2
# initialize count matrices
# @wt : word-topic matrix
wt <- matrix( 0, nrow = K, ncol = length(vocab) )
colnames(wt) <- vocab
# @ta : topic assignment list
ta <- lapply( docs, function(x) rep( 0, length(x) ) )
names(ta) <- paste0( "doc", 1:length(docs) )
# @dt : counts correspond to the number of words assigned to each topic for each document
dt <- matrix( 0, length(docs), K )
set.seed(2020)
for( d in 1:length(docs) ) {
# randomly assign topic to word w
for( w in 1:length( docs[[d]] ) ) {
ta[[d]][w] <- sample(1:K, 1)
# extract the topic index, word id and update the corresponding cell in the word-topic count matrix
ti <- ta[[d]][w]
wi <- docs[[d]][w]
wt[ti, wi] <- wt[ti, wi] + 1
# josh's comments- the initial value for wt[ti,wi] is 0, and now we update it to 1 because we assign a word to that topic. so the count of words increases to 1.
}
# count words in document d assigned to each topic t
# Josh's comment-okay, dt is a container for topic-document count
for( t in 1:K ) {
dt[d, t] <- sum( ta[[d]] == t )
}
}
# randomly assigned topic to each word
print(ta)## $doc1
## [1] 2 2 1 2 2 1 1 2 2 2 1 1 2 2 2
##
## $doc2
## [1] 2 2 2 2 1 2 2 1 2 1 2
##
## $doc3
## [1] 2 2 2 2 2 2 2 1 2 1 2 2
##
## $doc4
## [1] 2 1 2 2 2 1 2 2 2 1 2 1 1 1
##
## $doc5
## [1] 2 2 2 1 2 1 2 1 2 1 1 2 1 1 1 1 1 1
##
## $doc6
## [1] 1 1 2 2 2 1 1 2 2 1 2 1 2 1 2 1 2 1
##
## $doc7
## [1] 2 2 1 2 2 1 2 2 2 2 1 1 2 1 1 2 2 2 2 1 2 2 2 1
##
## $doc8
## [1] 2 1 1 2 1 2 2 2 1 1 2 2 2 2
##
## $doc9
## [1] 2 2 1 2 1 2 1 1 1 2
##
## $doc10
## [1] 2 2 1 1 1 1 1 1 2 1 2 1 1 1 2 2 1
## cuisine important part culture includes originating diverse regions china
## [1,] 1 0 1 0 0 1 0 1 0
## [2,] 1 1 0 1 1 1 1 1 2
## well overseas settled parts world preference seasoning cooking techniques
## [1,] 1 1 0 0 0 0 0 0 0
## [2,] 0 0 1 1 1 1 1 2 1
## provinces depend differences historical background ethnic groups society
## [1,] 1 0 0 1 0 1 0 0
## [2,] 0 1 2 0 1 0 1 1
## greatly valued gastronomy developed extensive study subject based
## [1,] 0 0 0 0 0 0 1 0
## [2,] 1 1 1 1 1 1 0 3
## traditional medical beliefs variety styles chefs classified eight regional
## [1,] 1 0 0 1 1 0 1 0 0
## [2,] 0 1 1 1 0 1 0 1 1
## cuisines according distinct tastes local characteristics raw materials
## [1,] 0 1 0 1 1 1 0 0
## [2,] 1 0 1 0 0 0 1 1
## ingredients used method preparation cultural foods different flavors
## [1,] 1 0 1 0 1 1 1 1
## [2,] 0 1 0 1 0 0 1 0
## textures prepared country referred united states canada also known
## [1,] 1 1 1 1 2 1 0 0 1
## [2,] 0 0 0 0 1 2 1 1 0
## gridironis team sport played two teams eleven players rectangular field
## [1,] 1 0 1 1 2 1 0 1 0 1
## [2,] 0 2 1 1 1 1 1 0 1 0
## goalposts end evolved sports soccer rugby first match november 6 1869
## [1,] 0 2 0 0 1 0 0 0 1 1 0
## [2,] 1 0 1 1 1 1 1 1 0 0 1
## college rutgers princeton using rules time popular forms game professional
## [1,] 3 0 0 0 1 1 1 0 0 1
## [2,] 0 1 1 1 1 0 1 1 3 1
## major levels high school youth winner scored points multiple ways score
## [1,] 1 0 0 0 0 0 1 0 1 1 1
## [2,] 0 1 1 1 1 1 0 1 0 0 0
## games last total 60 minutes play divided halves 30 four quarters 15
## [1,] 0 0 1 1 3 1 0 0 1 1 0 0
## [2,] 1 1 0 0 0 0 1 1 0 0 1 1
## [,1] [,2]
## [1,] 5 10
## [2,] 3 8
## [3,] 2 10
## [4,] 6 8
## [5,] 11 7
## [6,] 9 9
## [7,] 8 16
## [8,] 5 9
## [9,] 5 5
## [10,] 11 6
Notice that this random assignment gives you both the topic representations of all the documents and word distributions of all the topics (bad ones!!!). We need to improve this!! Optimize it!
There are a couple of ways to do this. But we focus on Gibbs Sampling method that performs the following steps for a user-specified iteration:
For each document d, go through each word w. Reassign a new topic to w from topic t with “the probability of word w given topic t” \(\times\) “probability of topic t given document d”, denoted by the following mathematical notations:
\[ P( z_i = j \text{ }| \text{ } z_{-i}, w_i, d_i ) \propto \frac{ C^{WT}_{w_ij} + \eta }{ \sum^W_{ w = 1 }C^{WT}_{wj} + W\eta } \times \frac{ C^{DT}_{d_ij} + \alpha }{ \sum^T_{ t = 1 }C^{DT}_{d_it} + T\alpha } \]
This formula is confusing! Let us talk bit by bit.
Starting from the left side of the equal sign:
- \(P(z_i = j)\) : The probability that token i is assigned to topic j.
- \(z_{-i}\) : Represents topic assignments of all other tokens.
- \(w_i\) : Word (index) of the \(i_{th}\) token.
- \(d_i\) : Document containing the \(i_{th}\) token.
For the right side of the equal sign:
- \(C^{WT}\) : Word-topic matrix, the
wtmatrix we generated. - \(\sum^W_{ w = 1 }C^{WT}_{wj}\) : Total number of tokens (words) in each topic.
- \(C^{DT}\) : Document-topic matrix, the
dtmatrix we generated. - \(\sum^T_{ t = 1 }C^{DT}_{d_it}\) : Total number of tokens (words) in document i.
- \(\eta\) : Parameter that sets the topic distribution for the words, the higher the more spread out the words will be across the specified number of topics (K).
- \(\alpha\) : Parameter that sets the topic distribution for the documents, the higher the more spread out the documents will be across the specified number of topics (K).
- \(W\) : Total number of words in the set of documents.
- \(T\) : Number of topics, equivalent of the K we defined earlier.
# parameters
alpha <- 1
eta <- 1
# initial topics assigned to the first word of the first document
# and its corresponding word id
t0 <- ta[[1]][1]
wid <- docs[[1]][1]
# z_-i means that we do not include token w in our word-topic and document-topic count matrix when sampling for token w, only leave the topic assignments of all other tokens for document 1
dt[1, t0] <- dt[1, t0] - 1
wt[t0, wid] <- wt[t0, wid] - 1
# Calculate left side and right side of equal sign
left <- ( wt[, wid] + eta ) / ( rowSums(wt) + length(vocab) * eta )
right <- ( dt[1, ] + alpha ) / ( sum( dt[1, ] ) + K * alpha )
# draw new topic for the first word in the first document
# The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero.
t1 <- sample(1:K, 1, prob = left * right)
t1## [1] 2
After the first iteration, the topic for the first word in the first document is updated to 2.Just remember after drawing the new topic we need to update the topic assignment list with newly sampled topic for token w; re-increment the word-topic and document-topic count matrices with the new sampled topic for token w.
We will use Ethen Liu’s user-written function [
LDA1][LDA] as a demo to run some interations, which takes the parameters of:
docsDocument that have be converted to token (word) ids.vocabUnique tokens (words) for all the document collection.KNumber of topic groups.alphaandetaDistribution parameters as explained earlier.iterationsNumber of iterations to run gibbs sampling to train our model.- Returns a list containing the final weight-topic count matrix
wtand document-topic matrixdt.
# define parameters
K <- 2
alpha <- 1
eta <- 0.001
iterations <- 1000
source("LDA_functions.R")
set.seed(2020)
lda1 <- LDA1( docs = docs, vocab = vocab,
K = K, alpha = alpha, eta = eta, iterations = iterations )
lda1## $wt
## cuisine important part culture includes originating diverse regions china
## [1,] 0 0 0 0 0 0 1 0 0
## [2,] 2 1 1 1 1 2 0 2 2
## well overseas settled parts world preference seasoning cooking techniques
## [1,] 0 1 0 0 0 0 0 0 0
## [2,] 1 0 1 1 1 1 1 2 1
## provinces depend differences historical background ethnic groups society
## [1,] 1 0 0 0 0 0 0 0
## [2,] 0 1 2 1 1 1 1 1
## greatly valued gastronomy developed extensive study subject based
## [1,] 0 0 0 1 1 0 1 3
## [2,] 1 1 1 0 0 1 0 0
## traditional medical beliefs variety styles chefs classified eight regional
## [1,] 1 0 1 0 0 1 0 0 0
## [2,] 0 1 0 2 1 0 1 1 1
## cuisines according distinct tastes local characteristics raw materials
## [1,] 0 0 0 0 0 0 0 0
## [2,] 1 1 1 1 1 1 1 1
## ingredients used method preparation cultural foods different flavors
## [1,] 0 0 0 1 0 0 0 1
## [2,] 1 1 1 0 1 1 2 0
## textures prepared country referred united states canada also known
## [1,] 0 0 0 0 3 3 1 1 1
## [2,] 1 1 1 1 0 0 0 0 0
## gridironis team sport played two teams eleven players rectangular field
## [1,] 1 2 2 2 3 2 1 1 1 1
## [2,] 0 0 0 0 0 0 0 0 0 0
## goalposts end evolved sports soccer rugby first match november 6 1869
## [1,] 0 2 1 1 2 1 1 0 1 0 1
## [2,] 1 0 0 0 0 0 0 1 0 1 0
## college rutgers princeton using rules time popular forms game professional
## [1,] 3 1 1 0 0 1 2 1 3 2
## [2,] 0 0 0 1 2 0 0 0 0 0
## major levels high school youth winner scored points multiple ways score
## [1,] 1 0 1 0 1 1 1 0 1 1 0
## [2,] 0 1 0 1 0 0 0 1 0 0 1
## games last total 60 minutes play divided halves 30 four quarters 15
## [1,] 1 0 1 1 3 0 1 0 1 1 1 1
## [2,] 0 1 0 0 0 1 0 1 0 0 0 0
##
## $dt
## [,1] [,2]
## [1,] 2 13
## [2,] 1 10
## [3,] 6 6
## [4,] 1 13
## [5,] 3 15
## [6,] 16 2
## [7,] 18 6
## [8,] 12 2
## [9,] 8 2
## [10,] 14 3
After we’re done with learning the topics for 1000 iterations, we can use the count matrices to obtain the word-topic distribution and document-topic distribution.
To compute the probability of word given topic:
\[\beta_{ij} = \frac{C^{WT}_{ij} + \eta}{\sum^W_{ k = 1 }C^{WT}_{kj} + W\eta}\]
Where \(\beta_{ij}\) is the probability of word i for topic j.
# topic probability of every word
( beta <- ( lda1$wt + eta ) / ( rowSums(lda1$wt) + length(vocab) * eta ) )## cuisine important part culture includes
## [1,] 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05
## [2,] 2.774465e-02 1.387926e-02 1.387926e-02 1.387926e-02 1.387926e-02
## originating diverse regions china well
## [1,] 1.232711e-05 1.233944e-02 1.232711e-05 1.232711e-05 1.232711e-05
## [2,] 2.774465e-02 1.386539e-05 2.774465e-02 2.774465e-02 1.387926e-02
## overseas settled parts world preference
## [1,] 1.233944e-02 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05
## [2,] 1.386539e-05 1.387926e-02 1.387926e-02 1.387926e-02 1.387926e-02
## seasoning cooking techniques provinces depend
## [1,] 1.232711e-05 1.232711e-05 1.232711e-05 1.233944e-02 1.232711e-05
## [2,] 1.387926e-02 2.774465e-02 1.387926e-02 1.386539e-05 1.387926e-02
## differences historical background ethnic groups
## [1,] 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05
## [2,] 2.774465e-02 1.387926e-02 1.387926e-02 1.387926e-02 1.387926e-02
## society greatly valued gastronomy developed
## [1,] 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05 1.233944e-02
## [2,] 1.387926e-02 1.387926e-02 1.387926e-02 1.387926e-02 1.386539e-05
## extensive study subject based traditional
## [1,] 1.233944e-02 1.232711e-05 1.233944e-02 3.699366e-02 1.233944e-02
## [2,] 1.386539e-05 1.387926e-02 1.386539e-05 1.386539e-05 1.386539e-05
## medical beliefs variety styles chefs
## [1,] 1.232711e-05 1.233944e-02 1.232711e-05 1.232711e-05 1.233944e-02
## [2,] 1.387926e-02 1.386539e-05 2.774465e-02 1.387926e-02 1.386539e-05
## classified eight regional cuisines according
## [1,] 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05
## [2,] 1.387926e-02 1.387926e-02 1.387926e-02 1.387926e-02 1.387926e-02
## distinct tastes local characteristics raw
## [1,] 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05
## [2,] 1.387926e-02 1.387926e-02 1.387926e-02 1.387926e-02 1.387926e-02
## materials ingredients used method preparation
## [1,] 1.232711e-05 1.232711e-05 1.232711e-05 1.232711e-05 1.233944e-02
## [2,] 1.387926e-02 1.387926e-02 1.387926e-02 1.387926e-02 1.386539e-05
## cultural foods different flavors textures
## [1,] 1.232711e-05 1.232711e-05 1.232711e-05 1.233944e-02 1.232711e-05
## [2,] 1.387926e-02 1.387926e-02 2.774465e-02 1.386539e-05 1.387926e-02
## prepared country referred united states
## [1,] 1.232711e-05 1.232711e-05 1.232711e-05 3.699366e-02 3.699366e-02
## [2,] 1.387926e-02 1.387926e-02 1.387926e-02 1.386539e-05 1.386539e-05
## canada also known gridironis team
## [1,] 1.233944e-02 1.233944e-02 1.233944e-02 1.233944e-02 2.466655e-02
## [2,] 1.386539e-05 1.386539e-05 1.386539e-05 1.386539e-05 1.386539e-05
## sport played two teams eleven
## [1,] 2.466655e-02 2.466655e-02 3.699366e-02 2.466655e-02 1.233944e-02
## [2,] 1.386539e-05 1.386539e-05 1.386539e-05 1.386539e-05 1.386539e-05
## players rectangular field goalposts end
## [1,] 1.233944e-02 1.233944e-02 1.233944e-02 1.232711e-05 2.466655e-02
## [2,] 1.386539e-05 1.386539e-05 1.386539e-05 1.387926e-02 1.386539e-05
## evolved sports soccer rugby first
## [1,] 1.233944e-02 1.233944e-02 2.466655e-02 1.233944e-02 1.233944e-02
## [2,] 1.386539e-05 1.386539e-05 1.386539e-05 1.386539e-05 1.386539e-05
## match november 6 1869 college
## [1,] 1.232711e-05 1.233944e-02 1.232711e-05 1.233944e-02 3.699366e-02
## [2,] 1.387926e-02 1.386539e-05 1.387926e-02 1.386539e-05 1.386539e-05
## rutgers princeton using rules time
## [1,] 1.233944e-02 1.233944e-02 1.232711e-05 1.232711e-05 1.233944e-02
## [2,] 1.386539e-05 1.386539e-05 1.387926e-02 2.774465e-02 1.386539e-05
## popular forms game professional major
## [1,] 2.466655e-02 1.233944e-02 3.699366e-02 2.466655e-02 1.233944e-02
## [2,] 1.386539e-05 1.386539e-05 1.386539e-05 1.386539e-05 1.386539e-05
## levels high school youth winner
## [1,] 1.232711e-05 1.233944e-02 1.232711e-05 1.233944e-02 1.233944e-02
## [2,] 1.387926e-02 1.386539e-05 1.387926e-02 1.386539e-05 1.386539e-05
## scored points multiple ways score
## [1,] 1.233944e-02 1.232711e-05 1.233944e-02 1.233944e-02 1.232711e-05
## [2,] 1.386539e-05 1.387926e-02 1.386539e-05 1.386539e-05 1.387926e-02
## games last total 60 minutes
## [1,] 1.233944e-02 1.232711e-05 1.233944e-02 1.233944e-02 3.699366e-02
## [2,] 1.386539e-05 1.387926e-02 1.386539e-05 1.386539e-05 1.386539e-05
## play divided halves 30 four
## [1,] 1.232711e-05 1.233944e-02 1.232711e-05 1.233944e-02 1.233944e-02
## [2,] 1.387926e-02 1.386539e-05 1.387926e-02 1.386539e-05 1.386539e-05
## quarters 15
## [1,] 1.233944e-02 1.233944e-02
## [2,] 1.386539e-05 1.386539e-05
\[\theta_{dj} = \frac{C^{DT}_{dj} + \alpha}{\sum^T_{ k = 1 }C^{DT}_{dk} + T\alpha}\]
Where \(\theta_{dj}\) is the proportion of topic j in document d.
# topic probability of every document
( theta <- ( lda1$dt + alpha ) / ( rowSums(lda1$dt) + K * alpha ) )## [,1] [,2]
## [1,] 0.1764706 0.8235294
## [2,] 0.1538462 0.8461538
## [3,] 0.5000000 0.5000000
## [4,] 0.1250000 0.8750000
## [5,] 0.2000000 0.8000000
## [6,] 0.8500000 0.1500000
## [7,] 0.7307692 0.2692308
## [8,] 0.8125000 0.1875000
## [9,] 0.7500000 0.2500000
## [10,] 0.7894737 0.2105263
Recall that LDA assumes that each document is a mixture of all topics, thus after computing the probability that each document belongs to each topic ( same goes for word & topic ) we can use this information to see which topic does each document belongs to and the more possible words that are associated with each topic. For more details on Gibbs Sampling, you can check Griffiths and Steyvers 2004 Finding Scientific topics.
# topic assigned to each document, the one with the highest probability
topic <- apply(theta, 1, which.max)
# possible words under each topic
# sort the probability and obtain the user-specified number n
Terms <- function(beta, n) {
term <- matrix(0, n, K)
for( p in 1:nrow(beta) ) {
term[, p] <- names( sort( beta[p, ], decreasing = TRUE )[1:n] )
}
return(term)
}
term <- Terms(beta = beta, n = 2)We specified that we wanted to see the top 2 terms associated with each topic. The following section prints out the original raw document, which is grouped into 2 groups that we specified and words that are likely to go along with each topic.
## $original_text
## [1] "society greatly valued gastronomy developed extensive study subject based traditional medical beliefs"
## [2] "referred united states canada also known gridironis team sport played two teams eleven players rectangular field goalposts end"
## [3] "evolved united states originating sports soccer rugby first match played november 6 1869 two college teams rutgers princeton using rules based rules soccer time"
## [4] "popular sport united states popular forms game professional college major levels high school youth"
## [5] "winner team scored points end game multiple ways score game"
## [6] "games last total 60 minutes professional college play divided two halves 30 minutes four quarters 15 minutes"
##
## $words
## [1] "based" "united"
## $original_text
## [1] "cuisine important part culture includes cuisine originating diverse regions china well overseas settled parts world"
## [2] "preference seasoning cooking techniques provinces depend differences historical background ethnic groups"
## [3] "variety styles cooking china chefs classified eight regional cuisines according distinct tastes local characteristics"
## [4] "based raw materials ingredients used method preparation cultural differences variety foods different flavors textures prepared different regions country"
##
## $words
## [1] "cuisine" "originating"
The output tells us that the first topic seems to be discussing something about united states , while the second is something about food. It is still messy, not that intuitive. But at least it is a good starting point.
Now let us move to use the R library topicmodels to fit a LDA.
Since the starting point of gibbs sampling is chosen randomly, thus it makes sense to discard the first few iteration ( also known as
burn-inperiods ). Due to the fact that they most likely do not correctly reflect the properties of distribution. And another parameter isthin, the number of iterations ommitted during the training. This serves to prevent correlations between samples during the iteration.We’ll use the
LDAfunction from the topicmodels library to implement gibbs sampling method on the same set of raw documents and print out the result for you to compare. Note that library has a default of value of 50 / K for \(\alpha\) and 0.1 for \(\eta\).
# load packages if not installed, using install.packages("topicmodels")
library(tm)
library(topicmodels)
# @burnin : number of omitted Gibbs iterations at beginning
# @thin : number of omitted in-between Gibbs iterations
docs1 <- Corpus( VectorSource(raw_docs) )
dtm <- DocumentTermMatrix(docs1)
# josh'cc- the input of LDA is a document-term matrix. You can use tm::DocumentTermMatrix to create it. Note you can also use tidytext package to do this. We will talk about it later when we fit a stm.
lda <- LDA( dtm, k = 2, method = "Gibbs",
control = list(seed = 2020, burnin = 500, thin = 100, iter = 4000) )
list( original_text = raw_docs[ topics(lda) == 1 ], words = terms(lda, 3)[, 1] )## $original_text
## [1] "cuisine important part culture includes cuisine originating diverse regions china well overseas settled parts world"
## [2] "preference seasoning cooking techniques provinces depend differences historical background ethnic groups"
## [3] "society greatly valued gastronomy developed extensive study subject based traditional medical beliefs"
## [4] "variety styles cooking china chefs classified eight regional cuisines according distinct tastes local characteristics"
## [5] "based raw materials ingredients used method preparation cultural differences variety foods different flavors textures prepared different regions country"
## [6] "games last total 60 minutes professional college play divided two halves 30 minutes four quarters 15 minutes"
##
## $words
## [1] "minutes" "china" "cuisine"
## $original_text
## [1] "referred united states canada also known gridironis team sport played two teams eleven players rectangular field goalposts end"
## [2] "evolved united states originating sports soccer rugby first match played november 6 1869 two college teams rutgers princeton using rules based rules soccer time"
## [3] "popular sport united states popular forms game professional college major levels high school youth"
## [4] "winner team scored points end game multiple ways score game"
##
## $words
## [1] "based" "states" "two"
Notice that after training the model for 4000 iterations and using a different \(\alpha\) and \(\eta\) value, we obtained a different document clustering result and different words that are more likely to associate with each topic. Since the goal here is to peform a clustering (unsupervised) method to unveil unknown patterns, the solutions will most likely differ as there is no such thing as a correct answer. We should try a range of different values of K to find the optimal topic grouping of the set of documents and see which result matches our intuition more.
Structural Topic Model
In this part we heavily rely on stm’s tutorial by Molly Roberts, Brandon Stewart and Dustin Tingley and an application by[Jula Silge] (https://juliasilge.com/blog/sherlock-holmes-stm/). We will go through their tutorial and show you how to do stm in R librabry stm.
Let us install stm first.
#library(devtools)
#install_github("bstewart/stm",dependencies=TRUE)
library(stm)
library(tidyverse)We use the data from stm tutorial, it is about political blogs in 2008 http://reports-archive.adm.cs.cmu.edu/anon/ml2010/CMU-ML-10-101.pdf.
## Warning: Missing column names filled in: 'X1' [1]
Before we run topic models like lda, we need to preprocess data. STM provides several functions to automatially do stemming, stopwords removal, low frequency words removal, etc for you.
Here is the graph of stm processors:
STM process
Let us use the textProcessor to preprocess texts. Here is the function:
textProcessor(documents, metadata = NULL, lowercase = TRUE, removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE, ucp = FALSE, stem = TRUE, wordLengths = c(3, Inf), sparselevel = 1, language = “en”, verbose = TRUE, onlycharacter = FALSE, striphtml = FALSE, customstopwords = NULL, custompunctuation = NULL, v1 = FALSE)
#Preprocessing
#stemming/stopword removal, etc.
#Josh-cc, if you don't know the details of a function, you can use ? to check the documentation of that function. ?textProcessor
processed <- textProcessor(data$documents, metadata=data)## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
Let us use prepDocuments to perform several corpus manipulations including removing words and renumbering word indices. here is the function:
prepDocuments(documents, vocab, meta = NULL, lower.thresh = 1, upper.thresh = Inf, subsample = NULL, verbose = TRUE)
#before running prepDocuments, you can use plotRemoved function to check the appropriate threshold to remove words or docuemnts.
#take a look at how many words and documents would be removed with different lower.thresholds !!! check Error: could not find function "plotRemoved"
plotRemoved(processed$documents, lower.thresh=seq(1,200, by=100))#structure and index for usage in the stm model. Verify no-missingness. can remove low frequency words using 'lower.thresh' option. See ?prepDocuments for more info
out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh=1)## Removing 83198 of 123990 terms (83198 of 2298953 tokens) due to frequency
## Your corpus now has 13246 documents, 40792 terms and 2215755 tokens.
#output will have object meta, documents, and vocab
docs <- out$documents
vocab <- out$vocab
meta <-out$metaNow, let us use stm function fit a stm model.
The function takes sparse representation of a document-term matrix, an integer number of topics, and covariates and returns fitted model parameters. Covariates can be used in the prior for topic prevalence, in the prior for topical content or both.
stm(documents, vocab, K, prevalence = NULL, content = NULL, data = NULL, init.type = c(“Spectral”, “LDA”, “Random”, “Custom”), seed = NULL, max.em.its = 500, emtol = 1e-05, verbose = TRUE, reportevery = 5, LDAbeta = TRUE, interactions = TRUE, ngroups = 1, model = NULL, gamma.prior = c(“Pooled”, “L1”), sigma.prior = 0, kappa.prior = c(“L1”, “Jeffreys”), control = list())
#run an stm model using the 'out' data. 20 topics. Asking how prevalaence of topics varies across documents' meta data, including 'rating' and day. !! option s(day) applies a spline normalization to day variable.
# max.em.its should be at least 100. We use 15 just as demo
poliblogPrevFit <- stm(out$documents,out$vocab,K=20,prevalence =~ rating+ s(day), max.em.its=15, data=out$meta,seed=2020)## Beginning Spectral Initialization
## Calculating the gram matrix...
## Using only 10000 most frequent terms during initialization...
## Finding anchor words...
## ....................
## Recovering initialization...
## ....................................................................................................
## Initialization complete.
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -8.203)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.813, relative change = 4.757e-02)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.775, relative change = 4.854e-03)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -7.759, relative change = 2.003e-03)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -7.750, relative change = 1.175e-03)
## Topic 1: bill, senat, legisl, vote, hous
## Topic 2: citi, mayor, marin, chicago, pentagon
## Topic 3: obama, mccain, campaign, hillari, will
## Topic 4: iran, obama, israel, nuclear, will
## Topic 5: will, year, new, govern, compani
## Topic 6: like, one, media, say, news
## Topic 7: senat, obama, will, lieberman, democrat
## Topic 8: court, vote, will, state, elect
## Topic 9: mccain, campaign, palin, john, said
## Topic 10: report, hous, said, investig, white
## Topic 11: tortur, think, know, ’re, peopl
## Topic 12: obama, peopl, polit, american, one
## Topic 13: iraq, war, iraqi, troop, militari
## Topic 14: get, will, ’ll, can, one
## Topic 15: bush, presid, will, administr, said
## Topic 16: democrat, clinton, parti, republican, will
## Topic 17: mccain, tax, will, american, economi
## Topic 18: said, attack, polic, kill, terrorist
## Topic 19: one, film, will, time, like
## Topic 20: state, nation, unit, will, american
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -7.744, relative change = 8.060e-04)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -7.739, relative change = 6.215e-04)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -7.735, relative change = 4.992e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -7.732, relative change = 4.126e-04)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -7.729, relative change = 3.508e-04)
## Topic 1: bill, democrat, senat, republican, vote
## Topic 2: citi, immigr, chicago, mayor, illeg
## Topic 3: obama, mccain, campaign, hillari, poll
## Topic 4: iran, obama, israel, nuclear, polici
## Topic 5: will, oil, price, year, compani
## Topic 6: media, like, news, one, say
## Topic 7: obama, senat, will, lieberman, democrat
## Topic 8: court, vote, state, will, law
## Topic 9: mccain, palin, campaign, john, said
## Topic 10: report, investig, hous, depart, said
## Topic 11: tortur, think, know, peopl, say
## Topic 12: obama, peopl, polit, american, one
## Topic 13: iraq, war, iraqi, troop, militari
## Topic 14: get, will, ’ll, doesn’t, one
## Topic 15: bush, presid, administr, said, will
## Topic 16: democrat, republican, parti, clinton, will
## Topic 17: tax, will, american, economi, mccain
## Topic 18: said, attack, polic, kill, terrorist
## Topic 19: film, one, time, will, like
## Topic 20: nation, will, american, world, secur
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -7.727, relative change = 3.094e-04)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -7.725, relative change = 2.776e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -7.723, relative change = 2.518e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -7.721, relative change = 2.292e-04)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
Like LDA, stm also need to specify the number of topics or themes (K) before fitting. Fortunately, stm provides a function selectModel to help you select the models with high likelihood values.
selectModel(documents, vocab, K, prevalence = NULL, content = NULL, data = NULL, max.em.its = 100, verbose = TRUE, init.type = “LDA”, emtol = 1e-05, seed = NULL, runs = 50, frexw = 0.7, net.max.em.its = 2, netverbose = FALSE, M = 10, N = NULL, to.disk = F, …)
#let STM help you compare a number of models side by side. It will keep the models that don't stink (i.e. that converge quickly)
poliblogSelect <- selectModel(out$documents,out$vocab,K=20,prevalence =~ rating+s(day), max.em.its=15,data=meta,runs=20,seed=2020)## Casting net
## 1 models in net
## 2 models in net
## 3 models in net
## 4 models in net
## 5 models in net
## 6 models in net
## 7 models in net
## 8 models in net
## 9 models in net
## 10 models in net
## 11 models in net
## 12 models in net
## 13 models in net
## 14 models in net
## 15 models in net
## 16 models in net
## 17 models in net
## 18 models in net
## 19 models in net
## 20 models in net
## Running select models
## 1 select model run
## Beginning LDA Initialization
## ....................................................................................................
## Completed E-Step (9 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -7.871)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.863, relative change = 1.057e-03)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.858, relative change = 6.854e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -7.852, relative change = 6.736e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -7.847, relative change = 7.057e-04)
## Topic 1: mccain, obama, campaign, john, mccain’
## Topic 2: attack, iran, terrorist, state, israel
## Topic 3: global, one, year, chang, fact
## Topic 4: will, world, one, peopl, believ
## Topic 5: obama, clinton, hillari, barack, campaign
## Topic 6: think, like, say, dont, know
## Topic 7: palin, obama, governor, senat, chicago
## Topic 8: will, get, last, day, romney
## Topic 9: iraq, war, militari, iraqi, forc
## Topic 10: tax, money, oil, govern, compani
## Topic 11: will, american, need, work, can
## Topic 12: one, like, stori, time, love
## Topic 13: year, school, citi, two, univers
## Topic 14: conserv, right, polit, issu, liber
## Topic 15: think, want, question, ’re, say
## Topic 16: said, report, hous, sen, bush
## Topic 17: bush, administr, presid, use, tortur
## Topic 18: law, state, court, rule, will
## Topic 19: time, report, new, york, stori
## Topic 20: democrat, vote, republican, poll, state
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -7.841, relative change = 7.467e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -7.835, relative change = 7.832e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -7.829, relative change = 7.978e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -7.822, relative change = 7.809e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -7.817, relative change = 7.343e-04)
## Topic 1: mccain, obama, campaign, john, mccain’
## Topic 2: iran, attack, state, terrorist, israel
## Topic 3: global, year, chang, one, fact
## Topic 4: will, world, one, peopl, russia
## Topic 5: obama, clinton, hillari, barack, campaign
## Topic 6: think, like, say, dont, peopl
## Topic 7: palin, governor, chicago, obama, senat
## Topic 8: will, last, get, day, romney
## Topic 9: iraq, war, militari, iraqi, troop
## Topic 10: tax, money, oil, govern, compani
## Topic 11: will, american, need, work, can
## Topic 12: one, stori, like, love, life
## Topic 13: school, year, citi, univers, two
## Topic 14: conserv, right, polit, issu, liber
## Topic 15: think, want, question, say, get
## Topic 16: said, report, hous, sen, bush
## Topic 17: bush, administr, presid, tortur, use
## Topic 18: law, court, state, rule, will
## Topic 19: time, report, new, york, stori
## Topic 20: democrat, vote, republican, poll, state
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -7.811, relative change = 6.951e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -7.806, relative change = 6.726e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -7.801, relative change = 6.430e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -7.796, relative change = 6.101e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
## 2 select model run
## Beginning LDA Initialization
## ....................................................................................................
## Completed E-Step (10 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -7.873)
## ....................................................................................................
## Completed E-Step (9 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.865, relative change = 1.024e-03)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.860, relative change = 6.820e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -7.855, relative change = 6.681e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -7.849, relative change = 6.892e-04)
## Topic 1: tax, will, govern, econom, money
## Topic 2: obama, campaign, mccain, clinton, barack
## Topic 3: say, like, think, one, know
## Topic 4: iraq, war, militari, forc, iraqi
## Topic 5: bush, administr, law, court, presid
## Topic 6: peopl, know, think, dont, say
## Topic 7: palin, year, biden, sarah, time
## Topic 8: senat, democrat, bill, republican, hous
## Topic 9: call, group, video, organ, immigr
## Topic 10: obama, polit, chicago, investig, charg
## Topic 11: voter, vote, state, poll, elect
## Topic 12: report, new, work, offic, said
## Topic 13: oil, will, energi, price, health
## Topic 14: iran, terrorist, nation, state, attack
## Topic 15: one, women, time, day, two
## Topic 16: peopl, black, american, life, will
## Topic 17: get, like, obama’, make, want
## Topic 18: will, polit, republican, democrat, make
## Topic 19: mccain, said, john, bush, sen
## Topic 20: media, news, time, stori, report
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -7.844, relative change = 7.353e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -7.837, relative change = 7.811e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -7.831, relative change = 7.984e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -7.825, relative change = 7.678e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -7.819, relative change = 7.173e-04)
## Topic 1: tax, will, govern, econom, money
## Topic 2: obama, campaign, mccain, barack, clinton
## Topic 3: say, like, think, know, one
## Topic 4: iraq, war, militari, iraqi, troop
## Topic 5: bush, law, administr, court, hous
## Topic 6: peopl, know, think, say, dont
## Topic 7: palin, biden, year, sarah, governor
## Topic 8: senat, democrat, bill, hous, vote
## Topic 9: group, call, video, organ, immigr
## Topic 10: polit, investig, chicago, obama, charg
## Topic 11: voter, vote, state, poll, elect
## Topic 12: report, offic, new, said, work
## Topic 13: oil, energi, will, price, health
## Topic 14: iran, terrorist, state, attack, nation
## Topic 15: women, one, time, day, two
## Topic 16: black, peopl, american, life, will
## Topic 17: get, want, obama’, like, make
## Topic 18: will, polit, republican, democrat, conserv
## Topic 19: mccain, said, john, bush, sen
## Topic 20: media, news, time, stori, new
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -7.814, relative change = 6.744e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -7.809, relative change = 6.376e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -7.805, relative change = 6.039e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -7.800, relative change = 5.667e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
## 3 select model run
## Beginning LDA Initialization
## ....................................................................................................
## Completed E-Step (10 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -7.874)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.866, relative change = 1.027e-03)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.861, relative change = 6.777e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -7.855, relative change = 6.635e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -7.850, relative change = 6.829e-04)
## Topic 1: media, time, post, press, articl
## Topic 2: obama, campaign, barack, will, obama’
## Topic 3: will, one, say, peopl, stori
## Topic 4: report, offici, investig, depart, former
## Topic 5: congress, bill, will, hous, legisl
## Topic 6: republican, democrat, senat, vote, gop
## Topic 7: mccain, campaign, obama, john, palin
## Topic 8: hillari, clinton, poll, democrat, voter
## Topic 9: get, want, even, make, pay
## Topic 10: will, american, can, countri, america
## Topic 11: think, like, say, peopl, know
## Topic 12: tax, oil, econom, govern, million
## Topic 13: show, like, get, ’re, just
## Topic 14: iraq, war, militari, iran, iraqi
## Topic 15: school, one, stori, immigr, eastern
## Topic 16: bush, said, presid, say, sen
## Topic 17: black, women, live, liber, one
## Topic 18: state, court, law, elect, right
## Topic 19: polit, polici, nation, power, support
## Topic 20: world, nation, will, govern, use
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -7.844, relative change = 7.110e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -7.839, relative change = 7.317e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -7.833, relative change = 7.253e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -7.827, relative change = 7.121e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -7.822, relative change = 7.025e-04)
## Topic 1: media, time, post, press, read
## Topic 2: obama, campaign, barack, will, obama’
## Topic 3: will, one, say, peopl, kill
## Topic 4: report, offici, investig, depart, former
## Topic 5: congress, bill, will, hous, legisl
## Topic 6: republican, senat, democrat, vote, gop
## Topic 7: mccain, john, campaign, palin, obama
## Topic 8: hillari, clinton, poll, democrat, voter
## Topic 9: get, want, even, make, pay
## Topic 10: will, american, can, countri, america
## Topic 11: think, like, say, peopl, know
## Topic 12: tax, oil, econom, govern, million
## Topic 13: show, ’re, like, get, ’ll
## Topic 14: iraq, war, militari, iran, iraqi
## Topic 15: school, stori, immigr, one, eastern
## Topic 16: bush, said, presid, say, sen
## Topic 17: black, women, live, liber, one
## Topic 18: state, court, law, elect, rule
## Topic 19: polit, polici, nation, support, power
## Topic 20: world, nation, govern, will, forc
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -7.816, relative change = 6.945e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -7.811, relative change = 6.646e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -7.806, relative change = 6.189e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -7.802, relative change = 5.785e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
## 4 select model run
## Beginning LDA Initialization
## ....................................................................................................
## Completed E-Step (10 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -7.875)
## ....................................................................................................
## Completed E-Step (9 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.867, relative change = 1.005e-03)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.861, relative change = 6.878e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -7.856, relative change = 6.855e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -7.850, relative change = 7.140e-04)
## Topic 1: obama, barack, campaign, obama’, polit
## Topic 2: mccain, john, said, campaign, report
## Topic 3: law, right, court, state, use
## Topic 4: elect, democrat, vote, parti, republican
## Topic 5: iraq, war, militari, iraqi, forc
## Topic 6: think, say, like, said, dont
## Topic 7: oil, will, energi, price, compani
## Topic 8: senat, democrat, bill, republican, hous
## Topic 9: ’re, don’t, get, think, doesn’t
## Topic 10: palin, women, run, one, experi
## Topic 11: obama, clinton, hillari, poll, mccain
## Topic 12: global, one, year, articl, new
## Topic 13: iran, bush, foreign, presid, polici
## Topic 14: tax, econom, economi, govern, plan
## Topic 15: one, two, armi, kill, polic
## Topic 16: will, american, peopl, countri, america
## Topic 17: will, debat, updat, deleg, romney
## Topic 18: bush, report, hous, administr, presid
## Topic 19: media, time, stori, news, liber
## Topic 20: polit, school, year, public, work
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -7.844, relative change = 7.736e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -7.838, relative change = 8.280e-04)
## ....................................................................................................
## Completed E-Step (8 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -7.831, relative change = 8.120e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -7.825, relative change = 7.917e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -7.819, relative change = 7.528e-04)
## Topic 1: obama, barack, campaign, obama’, polit
## Topic 2: mccain, john, said, campaign, report
## Topic 3: law, court, right, state, use
## Topic 4: elect, democrat, vote, parti, republican
## Topic 5: iraq, war, militari, iraqi, troop
## Topic 6: think, say, like, said, know
## Topic 7: oil, energi, will, price, compani
## Topic 8: senat, bill, democrat, republican, hous
## Topic 9: ’re, don’t, get, think, question
## Topic 10: palin, women, run, experi, sarah
## Topic 11: obama, clinton, hillari, poll, mccain
## Topic 12: global, year, articl, one, new
## Topic 13: iran, foreign, bush, israel, polici
## Topic 14: tax, econom, govern, economi, plan
## Topic 15: one, two, kill, armi, polic
## Topic 16: will, american, peopl, countri, america
## Topic 17: will, debat, updat, deleg, romney
## Topic 18: bush, report, hous, administr, said
## Topic 19: media, time, stori, news, use
## Topic 20: polit, school, year, public, work
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -7.814, relative change = 7.046e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -7.809, relative change = 6.571e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -7.804, relative change = 6.138e-04)
## ....................................................................................................
## Completed E-Step (7 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -7.799, relative change = 5.758e-04)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
#plot the different models that make the cut along exclusivity and semantic coherence of their topics
plotModels(poliblogSelect)#the 3rd one looks best, so choose it and give it the name poliblogPrevFit
poliblogPrevFit<-poliblogSelect$runout[[3]] #choose the third modelNow it is time to intepret the stm model.
## Topic 1 Top Words:
## Highest Prob: media, time, post, read, press, news, articl
## FREX: blagojevich, publish, abort, blogger, editor, mainstream, magazin
## Lift: bienenstock, apss, tertio, okubo, ‘emiss, bienenstockphys, decidendi
## Score: media, blagojevich, post, book, read, articl, press
## Topic 7 Top Words:
## Highest Prob: mccain, john, palin, campaign, attack, debat, biden
## FREX: palin, biden, sarah, ayer, couric, surrog, wurzelbach
## Lift: wpxi, sleaziest, rovito, verrilli, kmox, homina, trooper-g
## Score: mccain, palin, campaign, john, biden, mccain’, sarah
## Topic 10 Top Words:
## Highest Prob: will, american, can, countri, america, peopl, need
## FREX: care, job, america, health, togeth, afford, poverti
## Lift: --hour, filtrat, ninety-f, -compet, voteforchangecom, --young, -teach
## Score: will, american, america, countri, care, health, presid
Let us do a wordcloud, but I am not suggesting you to do this in your published research.
Let us find some texts that are most representative for a particular topic using findThoughts function:
Outputs most representative documents for a particular topic. Use this in order to get a better sense of the content of actual documents with a high topical content.
findThoughts(model, texts = NULL, topics = NULL, n = 3, thresh = NULL, where = NULL, meta = NULL)
#object 'thoughts1' contains 2 documents about topic 1. 'texts=shortdoc,' gives you just the first 250 words
data <- data %>%
mutate(shortdoc=str_extract(documents,"^.{250}"))
thoughts1 <- findThoughts(poliblogPrevFit,
texts=data$shortdoc,
n=2,
topics=1)$docs[[1]]
#will show you the output
plotQuote(thoughts1, width=40, main="Topic 1") Let use find more documents for topics
#how about more documents for more of these topics?
thoughts7 <- findThoughts(poliblogPrevFit,
texts=data$shortdoc,
n=2,
topics=7)$docs[[1]]
thoughts10 <- findThoughts(poliblogPrevFit,
texts=data$shortdoc,
n=2,
topics=10)$docs[[1]]
thoughts4 <- findThoughts(poliblogPrevFit,
texts=data$shortdoc,
n=2,
topics=4)$docs[[1]]
#And in a 2X2 table? We like 2X2 tables! --- Note: this command will force all remaining plots into a 2X2 table format
par(mfrow = c(2, 2),mar=c(.5,.5,1,.5))
plotQuote(thoughts1, width=40, main="Topic 1")
plotQuote(thoughts4, width=40, main="Topic 4")
plotQuote(thoughts7, width=40, main="Topic 7")
plotQuote(thoughts10, width=40, main="Topic 10") Let us see PROPORTION OF EACH TOPIC in the entire CORPUS.
Let us see how topics are correlated…
##see GRAPHICAL NETWORK DISPLAY of how closely related topics are to one another, (i.e., how likely they are to appear in the same document) Requires 'igraph' package
mod.out.corr<-topicCorr(poliblogPrevFit)
plot.topicCorr(mod.out.corr)Let use see topical content by covariates
##VISUALIZE DIFFERENCES BETWEEN TWO DIFFERENT TOPICS using the ,type="perspectives" option
plot.STM(poliblogPrevFit,type="perspectives", topics=c(9, 10))Let see how prevalence of topics varies across documents based on document covariates.
###See CORRELATIONS BTWN METADATA & TOPIC PREVALANCE in documents
###First, must estimate an effect of the metadata covariates on topic prevalence in a document, so that we have anything to plot
###Estimating the expected proportion of a document that belongs to a topic as a function of a covariate
#Right-quick, for this vignette, we need to tell R to convert the 'rating' variable in the meta file (which is currently a string variable) into a categorical variable
meta$rating<-as.factor(meta$rating)
#since we're preparing these coVariates by estimating their effects we call these estimated effects 'prep'
#we're estimating Effects across all 20 topics, 1:20. We're using 'rating' and normalized 'day,' using the topic model poliblogPrevFit.
#The meta data file we call meta. We are telling it to generate the model while accounting for all possible uncertainty. Note: when estimating effects of one covariate, others are held at their mean
prep <- estimateEffect(1:20 ~ rating+s(day),poliblogPrevFit,meta=meta, uncertainty = "Global")
###See how PREVALENCE of TOPICS DIFFERS across VALUES of a CATEGORICAL COVARIATE
plot.estimateEffect(prep, covariate = "rating", topics = c(1, 7, 10),
#topic model=poliblogPrevFit. Method="difference"
model=poliblogPrevFit, method="difference",
#only using two values of covariate, and labeling them... assume we could do this with a non-binary covariate and just specify
cov.value1="Liberal",cov.value2="Conservative",
xlab="More Conservative ... More Liberal",
main="Effect of Liberal vs. Conservative",
xlim=c(-.1,.1), labeltype = "custom",
custom.labels = c('Jeremiah Wright', 'Sarah Palin',
'Bush Presidency'))#See how PREVALENCE of TOPICS DIFFERS across VALUES of a CONTINUOUS COVARIATE
#plotting prep data on day variable, a continuous variable with a continous plot. focusing on topic 7. not sure what model= z means !!! (removal has no effect on image) will want to do this for protester topics a lot!
plot.estimateEffect(prep, "day", method="continuous", topics=7, model=z,
printlegend=FALSE, xaxt="n", xlab="Time (2008)")
monthseq <- seq(from=as.Date("2008-01-01"),
to=as.Date("2008-12-01"), by="month")
monthnames <- months(monthseq)
axis(1,at=as.numeric(monthseq)-min(as.numeric(monthseq)),
labels=monthnames)Let us see how words of the topics are emphasized differently across documents according to document covariates
#### Instead of looking at how prevalent a topic is in a class of documents categorized by meta-data covariate...
#### ... let's see how the words of the topic are emphasized differently in documents of each category of the covariate
##First, we we estimate a new stm. It's the same as the old one, including prevalence option, but we add in a content option
poliblogContent <- stm(out$documents,out$vocab,K=20,
prevalence =~ rating+ s(day), content=~rating,
max.em.its=15, data=out$meta,seed=2020)## Beginning Spectral Initialization
## Calculating the gram matrix...
## Using only 10000 most frequent terms during initialization...
## Finding anchor words...
## ....................
## Recovering initialization...
## ....................................................................................................
## Initialization complete.
## ....................................................................................................
## Completed E-Step (7 seconds).
## ....................................................................................................
## Completed M-Step (93 seconds).
## Completing Iteration 1 (approx. per word bound = -8.203)
## ....................................................................................................
## Completed E-Step (6 seconds).
## ....................................................................................................
## Completed M-Step (94 seconds).
## Completing Iteration 2 (approx. per word bound = -7.804, relative change = 4.867e-02)
## ....................................................................................................
## Completed E-Step (6 seconds).
## ....................................................................................................
## Completed M-Step (79 seconds).
## Completing Iteration 3 (approx. per word bound = -7.756, relative change = 6.100e-03)
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (78 seconds).
## Completing Iteration 4 (approx. per word bound = -7.737, relative change = 2.479e-03)
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (78 seconds).
## Completing Iteration 5 (approx. per word bound = -7.726, relative change = 1.418e-03)
## Topic 1: nv-, ramstad, senate-pass, shadegg, sc-
## Topic 2: hahn, mahoney, councilmen, murtha, equestrian
## Topic 3: seven-point, overperform, unenthusiast, arg, wta
## Topic 4: npt, highly-enrich, obeidi, rozen, archivesth
## Topic 5: barclay, indebted, greenpeac, leed, guzzler
## Topic 6: unpretenti, saltsman’, blu-ray, ‘barack, uncool
## Topic 7: inouy, iyad, burri, baath, unacceptable”
## Topic 8: scotusblog, weinberg, wrong-doer, verse”, souter
## Topic 9: rogue”, iseman’, deconcini, bachtel, gration
## Topic 10: lurita, bloch’, osno, bloch, gsa
## Topic 11: cra, bottom”, paulson, mukasey’, fail”
## Topic 12: gitlin, illiber, heliocentr, bicentenni, meritocraci
## Topic 13: armour, maliki, nouri, anbar, shi’
## Topic 14: spca, slunk, gust, light-heart, chambliss’
## Topic 15: yediot, slovenia, poznan, berlusconi, trans-atlant
## Topic 16: seiu, veneman, rbc, efca, korg
## Topic 17: cuts’, “strengthen, “free-market, ketchikan, paygo
## Topic 18: tharin, cpr, ndjamena, outboard, hydrazin
## Topic 19: luttwak, andromeda, dawkin, elah, naturalist
## Topic 20: khalilzad, zalmay, davo, switzerland, condoleeza
## Aspect 1: fil, sown, hosanna, redound, jeremiad
## Aspect 2: melber, ddaydigbi, ddayif, prescienc, teeve
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (78 seconds).
## Completing Iteration 6 (approx. per word bound = -7.719, relative change = 8.904e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (77 seconds).
## Completing Iteration 7 (approx. per word bound = -7.714, relative change = 5.988e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (76 seconds).
## Completing Iteration 8 (approx. per word bound = -7.711, relative change = 4.164e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (77 seconds).
## Completing Iteration 9 (approx. per word bound = -7.709, relative change = 3.016e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (77 seconds).
## Completing Iteration 10 (approx. per word bound = -7.707, relative change = 2.278e-04)
## Topic 1: ramstad, boehner, nv-, r-calif, bailout…
## Topic 2: councilmen, mahoney, susquehanna, d-nc, murtha
## Topic 3: overperform, arg, wta, selzer, winner-take-
## Topic 4: npt, obeidi, rozen, highly-enrich, warhead
## Topic 5: counterparti, barclay, “price, greenpeac, cis
## Topic 6: shoprit, saltsman’, uncool, one-dimension, ‘barack
## Topic 7: inouy, burri, iyad, unacceptable”, -conn
## Topic 8: scotusblog, wrong-doer, verse”, souter, weinberg
## Topic 9: deconcini, salter, hazelbak, iseman’, politifact
## Topic 10: osc, osno, branchflow, dannehi, gsa
## Topic 11: cra, dobbs’, secretary’, fail”, policy’
## Topic 12: peopleand, remember, gitlin, bicentenni, multiculturalist
## Topic 13: maliki, khost, ungovern, hiltermann, guardianfilm
## Topic 14: spca, chambliss, saxbi, chambliss’, plush
## Topic 15: yediot, slovenia, zaidi’, berlusconi, subservi
## Topic 16: korg, germond, ceja, rospar, culinari
## Topic 17: cuts’, “strengthen, gasoline”, “free-market, overregul
## Topic 18: ivin, cairn, crimes”, anthrax, respiratori
## Topic 19: unwatch, elah, dawkin, helter, skelter
## Topic 20: khalilzad, zalmay, switzerland, davo, gareth
## Aspect 1: livid, hosanna, jeremiad, redound, --year-old
## Aspect 2: ddayif, prescienc, melber, themdday, ddaydigbi
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (77 seconds).
## Completing Iteration 11 (approx. per word bound = -7.706, relative change = 1.748e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (76 seconds).
## Completing Iteration 12 (approx. per word bound = -7.705, relative change = 1.348e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (76 seconds).
## Completing Iteration 13 (approx. per word bound = -7.704, relative change = 1.114e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (77 seconds).
## Completing Iteration 14 (approx. per word bound = -7.703, relative change = 9.317e-05)
## ....................................................................................................
## Completed E-Step (5 seconds).
## ....................................................................................................
## Completed M-Step (75 seconds).
## Model Terminated Before Convergence Reached
##Next, we plot using the ,type="perspectives" option to the plot.STM function
plot.STM(poliblogContent,type="perspectives", topics=10)###Interacting covariates. Maybe we have a hypothesis that cities with low $$/capita become more repressive sooner, while cities with higher budgets are more patient
##first, we estimate an STM with the interaction
poliblogInteraction <- stm(out$documents,out$vocab,K=20,
prevalence =~ rating* day, max.em.its=15,
data=out$meta,seed=2020)## Beginning Spectral Initialization
## Calculating the gram matrix...
## Using only 10000 most frequent terms during initialization...
## Finding anchor words...
## ....................
## Recovering initialization...
## ....................................................................................................
## Initialization complete.
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -8.203)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.813, relative change = 4.754e-02)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.775, relative change = 4.823e-03)
## ....................................................................................................
## Completed E-Step (6 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -7.760, relative change = 1.983e-03)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -7.751, relative change = 1.160e-03)
## Topic 1: bill, senat, legisl, vote, hous
## Topic 2: citi, mayor, marin, chicago, pentagon
## Topic 3: obama, mccain, campaign, hillari, will
## Topic 4: iran, obama, israel, nuclear, will
## Topic 5: will, year, new, govern, compani
## Topic 6: like, one, media, say, news
## Topic 7: senat, obama, will, lieberman, democrat
## Topic 8: court, vote, will, state, elect
## Topic 9: mccain, campaign, palin, john, said
## Topic 10: report, hous, said, investig, white
## Topic 11: tortur, think, know, ’re, peopl
## Topic 12: obama, peopl, polit, american, one
## Topic 13: iraq, war, iraqi, troop, militari
## Topic 14: get, will, ’ll, can, one
## Topic 15: bush, presid, will, administr, said
## Topic 16: democrat, clinton, parti, republican, will
## Topic 17: mccain, tax, will, american, economi
## Topic 18: said, attack, polic, kill, terrorist
## Topic 19: one, film, will, time, like
## Topic 20: state, nation, unit, will, american
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -7.744, relative change = 7.990e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -7.740, relative change = 6.121e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -7.736, relative change = 4.909e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -7.733, relative change = 4.075e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -7.730, relative change = 3.538e-04)
## Topic 1: bill, democrat, republican, senat, vote
## Topic 2: citi, immigr, chicago, mayor, illeg
## Topic 3: obama, mccain, campaign, hillari, poll
## Topic 4: iran, obama, israel, nuclear, polici
## Topic 5: will, oil, price, year, compani
## Topic 6: media, like, news, one, say
## Topic 7: obama, senat, will, lieberman, democrat
## Topic 8: court, vote, state, will, law
## Topic 9: mccain, palin, campaign, john, said
## Topic 10: report, investig, hous, depart, said
## Topic 11: tortur, think, know, peopl, say
## Topic 12: obama, peopl, polit, american, one
## Topic 13: iraq, war, iraqi, troop, militari
## Topic 14: get, will, ’ll, doesn’t, one
## Topic 15: bush, presid, administr, said, will
## Topic 16: democrat, republican, parti, clinton, will
## Topic 17: tax, will, american, economi, mccain
## Topic 18: said, attack, polic, kill, terrorist
## Topic 19: film, one, time, like, will
## Topic 20: nation, will, american, world, secur
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -7.728, relative change = 3.073e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -7.726, relative change = 2.750e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -7.724, relative change = 2.493e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -7.722, relative change = 2.278e-04)
## ....................................................................................................
## Completed E-Step (5 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
#Then, as above, we prep the covariates usine the estimateEffect() function -- but this time, we include the interaction variable.
prep <- estimateEffect(c(1) ~ rating*day, poliblogInteraction,
metadata=meta, uncertainty="None")
#Then, we use our plotting function
plot.estimateEffect(prep, covariate="day", model=poliblogInteraction,
method="continuous",xlab="Days", moderator="rating",
moderator.value="Liberal", linecol="blue", ylim=c(0,.08),
printlegend=F)
plot.estimateEffect(prep, covariate="day", model=poliblogInteraction,
method="continuous",xlab="Days", moderator="rating",
moderator.value="Conservative", linecol="red", add=T,
printlegend=F)
legend(0,.08, c("Liberal", "Conservative"),
lwd=2, col=c("blue", "red"))Now let us use supplement packages to visualize stm outputs.
stmprinter: Print multiple stm model dashboards to a pdf file for inspection. Beautiful automated reports from multiple stm runs.
stminsights: A Shiny Application for Inspecting Structural Topic Models. A shiny GUI with beautiful graphics.
themetagenomics: Exploring Thematic Structure and Predicted Functionality of 16s rRNA Amplicon Data. . STM for rRNA data.
tidystm: Extract (tidy) effects from estimateEffect. Makes it easy to make ggplot2 graphics for STM.
stmgui: Shiny Application for Creating STM Models" . This is a Shiny GUI for running basic STM models.
stmBrowser: An R Package for the Structural Topic Model Browser.’’ This D3 visualization allows users to interactively explore the relationships between topics and the covariates estimated from the stm package in R.
stmCorrViz: A Tool for Structural Topic Model Visualizations. This package uses D3 to generate an interactive hierarchical topic explorer.
if (!requireNamespace("pacman"))
install.packages('pacman')
library(pacman)
packages<-c("stmprinter","stminsights","themetagenomics", "tidystm","stmgui",
"stmBrowser","stmCorrViz")
p_load(packages,character.only = TRUE)
#devtools::install_github("mikajoh/stmprinter")
#devtools::install_github("mikajoh/tidystm", dependencies = TRUE)
#devtools::install_github("mroberts/stmBrowser",dependencies=TRUE)Let us use stmBrowser to visualize our topic models. Check here for more details https://github.com/mroberts/stmBrowser. The major function is stmBrowser.
## Inspecting valid thresholds via grid search. Progress:
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 13%
|
|========== | 14%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 21%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|==================== | 28%
|
|===================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|=========================== | 38%
|
|============================ | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================== | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|================================ | 45%
|
|================================= | 46%
|
|================================= | 47%
|
|================================== | 48%
|
|=================================== | 49%
|
|=================================== | 51%
|
|==================================== | 52%
|
|===================================== | 53%
|
|===================================== | 54%
|
|====================================== | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================= | 71%
|
|================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
You can check here for the generated html visualization file.
Using TidyText with STM to Fit a Topic Model
So far, we have walked through stm tutorial. Now let us switch back to tidytext. Again you can check tidymining book for details how to model stm.
let us make a document-term matrix first
library(tidytext)
tidy_data <- data %>%
mutate(docid=row_number()) %>%
unnest_tokens(word,documents) %>%
anti_join(get_stopwords())
tidy_data %>%
count(word, sort = TRUE)## # A tibble: 113,862 x 2
## word n
## <chr> <int>
## 1 obama 29094
## 2 mccain 20601
## 3 one 15537
## 4 said 13819
## 5 people 11302
## 6 new 11274
## 7 campaign 10909
## 8 just 10899
## 9 can 10849
## 10 like 10116
## # … with 113,852 more rows
What are the highest tf-idf words in our documents? Let us plot them
data_tf_idf <- tidy_data %>%
count(docid, word, sort = TRUE) %>%
bind_tf_idf(word, docid, n) %>%
arrange(-tf_idf) %>%
group_by(docid) %>%
top_n(10) %>%
ungroup
data_tf_idf %>%
filter(docid%in%c(10,100,500,1000,1500,5000,6000,10000)) %>%
mutate(word = reorder_within(word, tf_idf, docid)) %>%
ggplot(aes(word, tf_idf, fill = docid)) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ docid, scales = "free", ncol = 2) +
scale_x_reordered() +
coord_flip() +
theme(strip.text=element_text(size=11)) +
labs(x = NULL, y = "tf-idf",
title = "Highest tf-idf words in a Document")let us get stm
library(quanteda)
library(stm)
dfm <- tidy_data %>%
count(docid, word, sort = TRUE) %>%
cast_dfm(docid, word, n)
topic_model <- stm(dfm, K = 20,
verbose = FALSE, init.type = "Spectral")let us plot the beta, which is topic-word distribution. here is the plate graphical model of stm, in case you are not familiar with it.
STM model
td_beta <- tidy(topic_model)
td_beta %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, ncol=4,scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for each topic",
subtitle = "")Now let plot gamma, the probability that each document is generated from each topic.
td_gamma <- tidy(topic_model, matrix = "gamma",
document_names = rownames(dfm))
ggplot(td_gamma, aes(gamma, fill = as.factor(topic))) +
geom_histogram(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, ncol = 5) +
labs(title = "Distribution of document probabilities for each topic",
subtitle = "",
y = "Number of documents", x = expression(gamma))Lab 2 Problem Set
Fit a stm model for the raw_docs provided for LDA analysis at the beginning on Chinese food and American Football. Create a covar indicating the first 5 sentences as “China” and the second 5 sentences as “America.”
Compare the difference in model outputs between lda and stm.
Send me a screen-shot before Tuesday 6 PM.
The End